Decision Trees¶
Decision trees are tree like model or a flowchart that splits the data based upon consequences, events or through any decisions¶
below image shows a graphical representation on how data is split into decision trees¶
The below is also an example of Classification tree¶
Classification tree¶

Tree Method¶
The data below is very much suited for a tree method , if your question is whether your friend will come to play or not¶

So in the decision tree we have the following elements :-¶
* The root - which is the first split or top of our tree¶
* The nodes - split of the value of a certain attribute¶
* The edges - outcome of the split to next node , these are also called internal nodes¶
* The leaf - which is the end point of each nodes , these are the terminal nodes that predict the outcome¶
Regression Tree¶
The below is an example of regression tree¶

How to grow a tree is best manner ?¶
Now lets consider the following example Dataset 'D'¶

from the dataset above we can understand that the feature Y splits the Class clearly , hence that it should be the root our tree , but since this is a small dataset we can easily identify the splits , but in case of large of Dataset we use recursive binary splitting or the Greeedy approach to find the best fit in our feature¶
Splits using Greedy approach (Regression Tree)¶
Given below gives us an idea on the greedy approach to find the best splits¶

Let's now see the steps to find the best split for our dataset and grow our regression tree¶
* Let the cube represent each feature in our Dataset , in our example D we take the features X,Y,Z and plot the points¶
* Now we have to find our first split , the root of our tree ,(the violet eqn gives us the math) , we split the data along a particular split such that the sum of RSS(Residual sum of squares) in both the regions tend to be minimum In our example D a split along the feature Y gives us an RSS value of zero but X and Z gives us RSS greater than zero , hence the greedy approach takes the split along Y as root of our tree¶
* After making our first split we get two regions say R1 and R2 , we follow the similar step in R1 and R2 and continue to grow our tree¶
* In case of Regression we can stop when there is one or five points in your leaf region , you can stop as per your convenience¶
* So our regression value would be the mean value from our region (or)¶
* We can predict the value of y from each region using the cross validation of linear Regression model¶
Splits using Greedy approach (Classifiaction Tree img:1 )¶

Let's now see the steps to find the best split for our dataset and grow our classification tree¶
* So the first best split is among the feature is taken such that the regions after the spilt should have a sum of low error rate. In our example dataset D , a split along Y makes it into two regions (1,1) and (0,0) thus the sum of error rate is minimum (Zero in our case)¶
* So by following this method we can grow our binary classification tree using greedy approach¶
* We can stop our tree until when our leaf contains region of single class¶
* So now we can predict the class of our new data by using the majority votes in our leaf region¶
Splits using Greedy approach (Classifiaction Tree img:2 )¶

Generaization of classification tree¶
We know that the Decision tree is predominantly used for classifiaction¶
* So there are many generalization methods through which we can determine the splits for our tree , the most common three are¶
1) Error Rate method¶
2) Entropy¶
3) Gini Index¶
1) here P suffix R denoted the probability¶

Bootstrap Aggregation ( Bagging )¶
Definition from Wiki :¶
Why do we need Bootstrap Aggregation ?¶
How does it acheive the same ?¶
If n′=n, then for large n the set is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates.¶
This kind of sample is known as a bootstrap sample. Then, m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).¶
here 'E' is the expected value¶
Y is the predicted value for x , from a bootstrap model¶
y is the true value for x¶
z is the average , used to aggregate from predicted values¶

from the last equation of the below model we can see that error rate for the average E(Z-y)pow2 depends on the number of bootstrap models . so if m tends to infinity , our error rate will tend to become zero¶
